Learning Word Vectors for 157 Languages
نویسندگان
چکیده
Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.
منابع مشابه
Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Current approaches to learning vector representations of text that are compatible between different languages usually require some amount of parallel text, aligned at word, sentence or at least document level. We hypothesize however, that different natural languages share enough semantic structure that it should be possible, in principle, to learn compatible vector representations just by analy...
متن کاملThe Effect of Mnemonic Key Word Method on Vocabulary Learning and Long Term Retention
Most of the studies on the key word method of second/foreign language vocabulary learning have been based on the evidence from laboratory experiments and have primarily involved the use of English key words to learn the vocabularies of other languages. Furthermore, comparatively quite limited number of such studies is done in authentic classroom contexts. The present study inquired into the eff...
متن کاملCodeswitching language identification using Subword Information Enriched Word Vectors
Codeswitching is a widely observed phenomenon among bilingual speakers. By combining subword information enriched word vectors with linear-chain Conditional Random Field, we develop a supervised machine learning model that identifies languages in a English-Spanish codeswitched tweets. Our computational method achieves a tweet-level weighted F1 of 0.83 and a token-level accuracy of 0.949 without...
متن کاملWords are not Equal: Graded Weighting Model for Building Composite Document Vectors
Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of suc...
متن کاملGenerating the Pseudo-Powers of a Word
The notions of power of word, periodicity and primitivity are intrinsically connected to the operation of catenation, that dynamically generates word repetitions. When considering generalizations of the power of a word, other operations will be the ones that dynamically generate such pseudo-repetitions. In this paper we define and investigate the operation of θ-catenation that gives rise to the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1802.06893 شماره
صفحات -
تاریخ انتشار 2018